ci: populate baselines and gate benchmarks at +20% by clean6378-max-it · Pull Request #91 · cppalliance/claude-code-chat-browser

clean6378-max-it · 2026-06-17T20:38:10Z

Closes #83

Summary

Adds a +20% benchmark regression gate to CI, populates benchmarks/baselines.json from a post-cache run, and documents how to refresh baselines.

Sprint item #6 (3 pt) — Wednesday PR 2 of 2. Depends on PR 1 (#90 session cache) so baselines reflect cached-path performance.

Problem

The CI benchmarks job ran pytest tests/benchmarks/ --benchmark-only and uploaded artifacts, but was labeled informational with no threshold. benchmarks/baselines.json had empty groups, so parse/export/search regressions passed silently.

Changes

`scripts/check_benchmark_regression.py` (new)

Compares pytest-benchmark JSON means against benchmarks/baselines.json
Fails when current_mean / baseline_mean > 1.20
Missing baseline for a new benchmark name: warn, do not fail
Missing current result for a baseline entry: warn and skip

`scripts/reduce_baselines.py` (new)

Reduces --benchmark-json output into baselines.json (parse / export / search groups only)
Optional --slack factor for cross-machine capture (e.g. faster local host vs ubuntu CI)

`benchmarks/baselines.json`

Populated with 7 gated benchmark means (3 parse, 3 export, 1 search)
Captured post-cache; cache group benchmarks are informational only (not gated)

CI (`.github/workflows/ci.yml`)

Job renamed: Performance benchmarks (informational) → Performance benchmarks (gated)
New step after benchmark run:
python scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.json
Gate runs on ubuntu-latest only

`Makefile` (new)

make update-baselines — run benchmarks and regenerate baselines.json
make check-benchmarks — local gate check

Tests

tests/test_check_benchmark_regression.py — missing-baseline warn (pass), >20% regression (fail), within threshold (pass)

Docs / hygiene

benchmarks/README.md — gate behavior and baseline refresh instructions
.gitignore — benchmark-results.json, benchmarks/_raw.json

Acceptance criteria

benchmarks/baselines.json populated from post-cache run
CI fails on injected >20% regression; passes when green
Missing-baseline case warns without failing (tested)
make update-baselines regenerates baselines
Job renamed to gated label

Test plan

pytest -q — 388 passed
pytest tests/test_check_benchmark_regression.py -q
python scripts/check_benchmark_regression.py benchmarks/_raw.json benchmarks/baselines.json
ruff check .

Notes

Baselines were captured locally with --slack 1.35 for ubuntu headroom. If the gated CI job fails on first run due to runner variance, refresh on ubuntu via make update-baselines and recommit.
Branch protection admins may want to mark the gated benchmarks job as a required check.

Summary by CodeRabbit

New Features
- Added a CI performance regression gate for selected benchmarks, failing builds when gated averages exceed stored baselines beyond the allowed threshold.
Chores
- Introduced benchmark tooling to update baselines, run regression checks locally, and clean generated benchmark artifacts.
- Updated CI to run regression checks before uploading benchmark artifacts.
Documentation
- Refreshed benchmark README with gate rules and baseline refresh instructions.
Tests
- Added coverage for baseline reduction and regression-gate behavior, including warnings and error handling.
Style
- Updated ignore rules for generated benchmark output files.

Add check_benchmark_regression.py and reduce_baselines.py; populate benchmarks/baselines.json from post-cache run; wire regression gate into ubuntu benchmarks job; add Makefile update-baselines target and tests.

coderabbitai · 2026-06-17T20:38:26Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a benchmark regression gate to CI by introducing two new scripts (check_benchmark_regression.py and reduce_baselines.py) that implement ratio-based regression detection and baseline generation. benchmarks/baselines.json is populated with measured values from a post-cache Ubuntu run. The CI workflow gains a gated regression-check step, a Makefile provides update-baselines and check-benchmarks targets, and .gitignore and benchmarks/README.md are updated accordingly.

Changes

Benchmark Regression Gate

Layer / File(s)	Summary
Regression check script implementation `scripts/check_benchmark_regression.py`	Defines `THRESHOLD = 1.20` and introduces `BenchmarkDataError` exception for malformed JSON. `load_results()` parses the `benchmarks` array with strict validation of `name` and `stats.mean` fields, converting means to floats. `load_baseline_means()` extracts baselines from a `groups` object and flattens them into `name -> baseline_mean` mappings. `check_regression()` computes `ratio = current_mean / baseline_mean`, prints `[ok]`/`[FAIL]` per benchmark, skips ratio checks when baseline is zero, warns (exit 0) for missing baselines or ungated benchmarks, and exits 1 if any gated ratio exceeds threshold. CLI entrypoint `main()` wires argparse-driven execution and converts `BenchmarkDataError` to stderr output with exit code 2.
Regression check tests `tests/test_check_benchmark_regression.py`	Comprehensive pytest coverage including helpers for writing benchmark result and baseline JSON fixtures. Tests verify `check_regression` return codes and stdout messages for missing-baseline warnings (exit 0), regressions exceeding 20% (exit 1), within-threshold results (exit 0), zero-baseline skip behavior, exactly-at-threshold success, and missing current results. Validation tests confirm malformed JSON and missing `benchmarks` array raise `BenchmarkDataError` with expected patterns, and CLI error reporting exits with code 2 and stderr output.
Baseline reduction script `scripts/reduce_baselines.py`	Defines `GATED_GROUPS = ("parse", "export", "search")` for filtering benchmarks. Introduces `_positive_float` validator to enforce `--slack > 0`. `reduce_baselines()` reads raw pytest-benchmark JSON, validates structure (requiring `benchmarks` array with `name` and `stats.mean`), filters entries to gated groups, scales extracted means by slack factor, and writes baselines.json containing `_note`, ISO-UTC `updated` timestamp, `machine` (from `machine_info.system`), and `groups` mapping. CLI `main()` wires argument parsing, handles `BenchmarkDataError` with exit code 2, and returns 0 on success.
Baseline reduction tests `tests/test_reduce_baselines.py`	Pytest tests validating `reduce_baselines()` behavior: filters output to only gated groups while excluding non-gated entries, scales baseline means by the `slack` parameter, rejects input JSON missing the `benchmarks` array, and CLI rejects non-positive `--slack` values with `SystemExit`.
CI integration and baseline data `.github/workflows/ci.yml`, `Makefile`, `.gitignore`, `benchmarks/README.md`, `benchmarks/baselines.json`	CI `benchmarks` job step renamed from "informational" to "gated"; new "Regression gate" step runs `check_benchmark_regression.py` on `benchmark-results.json` and `benchmarks/baselines.json` before artifact upload. `Makefile` adds `update-baselines` target (pytest `--benchmark-only` to `benchmarks/_raw.json` then `reduce_baselines.py`) and `check-benchmarks` target (pytest `--benchmark-only` to `benchmark-results.json` then `check_benchmark_regression.py`); includes `clean-benchmark-artifacts` to remove generated files. `.gitignore` excludes `benchmark-results.json` and `benchmarks/_raw.json`. README documents the +20% gate on ubuntu-latest, warning-only behavior for benchmarks without baselines, and `make update-baselines` refresh workflow with guidance that baselines must be captured on ubuntu-latest for consistency. `baselines.json` populated with measured parse/export/search timings, corrected `machine = "Linux"`, and `updated = "2026-06-17T21:00:00Z"` metadata.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

cppalliance/claude-code-chat-browser#76: Introduced the CI benchmarks job with --benchmark-only and artifact upload using the same benchmark-results.json file that this PR's regression gate consumes.

Suggested reviewers

timon0305
wpak-ai

Poem

🐇 Hoppity-hop through the benchmark land,
Where baselines are measured and gates firmly stand.
If your mean drifts too high — past twenty percent —
The CI cries FAIL and refuses consent.
But new benchmarks just warn with a friendly "take care,"
A gentle-eared rabbit watches performance with flair! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 4.35% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'ci: populate baselines and gate benchmarks at +20%' directly and specifically describes the main change—adding a +20% regression gate and populating baseline data in CI.
Linked Issues check	✅ Passed	The PR implements all core requirements from issue `#83`: populates baselines, gates on +20% threshold with graceful missing-baseline handling, documents via README and Makefile, renames CI job, gates ubuntu-latest only, and includes unit tests.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to benchmark gating infrastructure: scripts, CI workflow, baseline data, documentation, testing, and developer tooling—with no unrelated modifications to session cache or other features.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ci/benchmark-regression-gate

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

scripts/check_benchmark_regression.py (2)

12-27: ⚡ Quick win

Consider adding defensive error handling for malformed input.

Both load_results and load_baseline_means assume well-formed JSON with the expected schema (e.g., data["benchmarks"], entry["stats"]["mean"]). While pytest-benchmark should always produce correct output, missing keys or malformed JSON would raise unhelpful KeyError or TypeError exceptions in CI.

Adding try/except blocks with actionable error messages would improve debuggability when investigating CI failures.

🛡️ Proposed defensive improvements

 def load_results(results_path: str | Path) -> dict[str, float]:
-    data = json.loads(Path(results_path).read_text(encoding="utf-8"))
-    return {entry["name"]: float(entry["stats"]["mean"]) for entry in data["benchmarks"]}
+    try:
+        data = json.loads(Path(results_path).read_text(encoding="utf-8"))
+        return {entry["name"]: float(entry["stats"]["mean"]) for entry in data["benchmarks"]}
+    except (KeyError, TypeError) as e:
+        print(f"ERROR: malformed results JSON at {results_path}: {e}", file=sys.stderr)
+        raise

Similar pattern for load_baseline_means.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/check_benchmark_regression.py` around lines 12 - 27, Both
load_results and load_baseline_means functions lack error handling for malformed
JSON input and missing expected keys. Add try/except blocks around the JSON
parsing and data access logic in both functions to catch KeyError and TypeError
exceptions. When these exceptions occur, raise a new exception with an
actionable error message that includes the file path and details about which
expected key or structure was missing. This will provide meaningful error
context instead of unhelpful KeyError exceptions when CI receives malformed
JSON.

45-45: ⚡ Quick win

Add guard against division by zero.

If a baseline mean is 0, line 45 will raise ZeroDivisionError. While benchmarks with zero mean are implausible, defensive validation would prevent cryptic CI failures.

🛡️ Proposed guard

+        if base == 0:
+            print(f"WARN: baseline {name!r} has zero mean; skipping")
+            continue
         ratio = cur / base

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/check_benchmark_regression.py` at line 45, The ratio calculation at
the line where ratio is assigned as cur divided by base lacks protection against
division by zero. Add a guard condition to check if the base value is zero
before performing the division operation. If base is zero, handle it
appropriately by either skipping the ratio calculation, assigning a special
value, or logging a warning. This defensive check will prevent ZeroDivisionError
exceptions even though zero baseline means are unlikely in practice.

scripts/reduce_baselines.py (1)

44-49: ⚡ Quick win

Validate --slack parameter bounds.

The --slack argument accepts any float, including negative or zero values, which would produce incorrect baseline means. Adding validation would prevent user errors.

♻️ Proposed validation

     parser.add_argument(
         "--slack",
         type=float,
         default=1.0,
         help="multiply means by this factor (e.g. 1.25 when capturing on a faster host)",
     )
     args = parser.parse_args(argv)
+    if args.slack <= 0:
+        parser.error("--slack must be positive")
     reduce_baselines(args.raw_path, args.out_path, slack=args.slack)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/reduce_baselines.py` around lines 44 - 49, The `--slack` argument
parser definition lacks validation to ensure the parameter value is valid for
use as a multiplier. Add a constraint to the `--slack` argument definition to
validate that the value is greater than zero, rejecting negative or zero values
that would produce incorrect baseline means. You can use either the `choices`
parameter with a custom validation function or leverage the `type` parameter
with a callable validator to enforce this constraint.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/baselines.json`:
- Line 4: The "machine" field in benchmarks/baselines.json indicates the
baselines were captured on Windows, but the CI gates require ubuntu-latest
baselines to prevent cross-OS timing variance from causing false
positives/negatives. Recapture the baselines on an ubuntu environment by either
running "make update-baselines" after the ubuntu environment is available, or
manually running pytest with "--benchmark-only
--benchmark-json=benchmarks/_raw.json" flags followed by executing the
scripts/reduce_baselines.py script to generate the correct ubuntu-based
baselines.json file.

---

Nitpick comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 12-27: Both load_results and load_baseline_means functions lack
error handling for malformed JSON input and missing expected keys. Add
try/except blocks around the JSON parsing and data access logic in both
functions to catch KeyError and TypeError exceptions. When these exceptions
occur, raise a new exception with an actionable error message that includes the
file path and details about which expected key or structure was missing. This
will provide meaningful error context instead of unhelpful KeyError exceptions
when CI receives malformed JSON.
- Line 45: The ratio calculation at the line where ratio is assigned as cur
divided by base lacks protection against division by zero. Add a guard condition
to check if the base value is zero before performing the division operation. If
base is zero, handle it appropriately by either skipping the ratio calculation,
assigning a special value, or logging a warning. This defensive check will
prevent ZeroDivisionError exceptions even though zero baseline means are
unlikely in practice.

In `@scripts/reduce_baselines.py`:
- Around line 44-49: The `--slack` argument parser definition lacks validation
to ensure the parameter value is valid for use as a multiplier. Add a constraint
to the `--slack` argument definition to validate that the value is greater than
zero, rejecting negative or zero values that would produce incorrect baseline
means. You can use either the `choices` parameter with a custom validation
function or leverage the `type` parameter with a callable validator to enforce
this constraint.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 575f2d48-af8a-4cd9-9998-d03f8ce771b6

📥 Commits

Reviewing files that changed from the base of the PR and between 047efe8 and 12bf70c.

📒 Files selected for processing (8)

.github/workflows/ci.yml
.gitignore
Makefile
benchmarks/README.md
benchmarks/baselines.json
scripts/check_benchmark_regression.py
scripts/reduce_baselines.py
tests/test_check_benchmark_regression.py

Raise BenchmarkDataError for malformed JSON or missing keys; skip zero baselines safely; reject non-positive --slack values.

coderabbitai

🧹 Nitpick comments (1)

tests/test_check_benchmark_regression.py (1)
65-77: ⚡ Quick win

Add an explicit boundary test for exactly +20% (ratio == 1.20).

Current coverage checks above and below threshold, but not the exact cutoff. Since gating is defined as current / baseline > 1.20, add a case (e.g., 0.00012 vs 0.0001) asserting exit code 0 to lock in boundary semantics.

Also applies to: 80-107
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_check_benchmark_regression.py` around lines 65 - 77, The current
test coverage for the check_regression function does not include an explicit
test for the exact boundary condition where the performance ratio equals 1.20
(the threshold). Add a new test function that verifies the exact boundary case,
such as using a current value of 0.00012 against a baseline of 0.0001, which
produces exactly a 1.20 ratio, and assert that check_regression returns 0
(passes). This locks in the boundary semantics and ensures the gating logic
defined as current / baseline > 1.20 is correct at the exact cutoff point.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/test_check_benchmark_regression.py`:
- Around line 65-77: The current test coverage for the check_regression function
does not include an explicit test for the exact boundary condition where the
performance ratio equals 1.20 (the threshold). Add a new test function that
verifies the exact boundary case, such as using a current value of 0.00012
against a baseline of 0.0001, which produces exactly a 1.20 ratio, and assert
that check_regression returns 0 (passes). This locks in the boundary semantics
and ensures the gating logic defined as current / baseline > 1.20 is correct at
the exact cutoff point.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7390034-4231-4d61-b70c-0a328b1bbef3

📥 Commits

Reviewing files that changed from the base of the PR and between b9d0707 and 2482951.

📒 Files selected for processing (3)

scripts/check_benchmark_regression.py
scripts/reduce_baselines.py
tests/test_check_benchmark_regression.py

🚧 Files skipped from review as they are similar to previous changes (2)

scripts/reduce_baselines.py
scripts/check_benchmark_regression.py

clean6378-max-it · 2026-06-17T20:53:15Z

@CodeRabbit, please review this PR in more depth

coderabbitai · 2026-06-17T20:55:03Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== scripts/check_benchmark_regression.py ==="
cat scripts/check_benchmark_regression.py

echo ""
echo "=== scripts/reduce_baselines.py ==="
cat scripts/reduce_baselines.py

echo ""
echo "=== tests/test_check_benchmark_regression.py ==="
cat tests/test_check_benchmark_regression.py

echo ""
echo "=== benchmarks/baselines.json ==="
cat benchmarks/baselines.json

echo ""
echo "=== Makefile ==="
cat Makefile

echo ""
echo "=== .github/workflows/ci.yml ==="
cat .github/workflows/ci.yml

echo ""
echo "=== benchmarks/README.md ==="
cat benchmarks/README.md

echo ""
echo "=== .gitignore (last 20 lines) ==="
tail -20 .gitignore

Length of output: 21437

@clean6378-max-it Sure! Here is a detailed review of PR #91.

Deep Review: PR `#91` — Benchmark Regression Gate

🔴 Critical

1. `ci.yml` — Benchmark artifact lost on regression failure

The artifact upload step runs after the regression gate step. When the gate fails (exit 1), GitHub Actions skips subsequent steps by default, so benchmark-results.json is never uploaded — exactly when you need it most for debugging.

# .github/workflows/ci.yml
- name: Regression gate
  run: python scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.json

- uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4
  if: always()   # ← add this
  with:
    name: benchmark-results
    path: benchmark-results.json

2. `reduce_baselines.py` — Unhandled `KeyError` on malformed input

raw["benchmarks"] is accessed without a try/except. A malformed JSON file (no "benchmarks" key, missing entry["name"], or missing entry["stats"]["mean"]) will produce a bare traceback rather than a diagnostic message:

# reduce_baselines.py — current
for entry in raw["benchmarks"]:
    group = entry.get("group")
    if group not in GATED_GROUPS:
        continue
    groups[group][entry["name"]] = float(entry["stats"]["mean"]) * slack

check_benchmark_regression.py wraps the same access in BenchmarkDataError; reduce_baselines.py should do the same:

try:
    entries = raw["benchmarks"]
except (KeyError, TypeError) as exc:
    raise BenchmarkDataError(f"raw file missing top-level 'benchmarks' array") from exc
for index, entry in enumerate(entries):
    try:
        name = entry["name"]
        mean = float(entry["stats"]["mean"])
    except (KeyError, TypeError) as exc:
        raise BenchmarkDataError(f"benchmarks[{index}] missing 'name' or 'stats.mean'") from exc
    ...

3. `check_benchmark_regression.py` — `BenchmarkDataError` uncaught in `main()`

main() calls check_regression() directly without catching BenchmarkDataError. If the JSON is malformed, the user gets an unformatted Python traceback instead of a clean error message:

def main(argv: list[str] | None = None) -> int:
    argv = sys.argv[1:] if argv is None else argv
    if len(argv) != 2:
        ...
        return 2
    try:
        return check_regression(argv[0], argv[1])
    except BenchmarkDataError as exc:
        print(f"ERROR: {exc}", file=sys.stderr)
        return 2

🟡 Medium

4. `load_baseline_means` — Fragile fallback that silently discards data

The line groups = data.get("groups", data) falls back to the entire root dict when "groups" is absent. This means _note, updated, and machine are iterated as group names; they're skipped because their values aren't dicts, but it masks a real misconfiguration. A missing groups key should be an error:

# current — fragile
groups = data.get("groups", data)
if not isinstance(groups, dict):
    raise BenchmarkDataError(f"{path} missing 'groups' object")

# suggested — explicit
if "groups" not in data:
    raise BenchmarkDataError(f"{path} missing required 'groups' key")
groups = data["groups"]
if not isinstance(groups, dict):
    raise BenchmarkDataError(f"{path} 'groups' must be an object")

5. Missing tests for `reduce_baselines.py`

scripts/reduce_baselines.py has zero tests. Given it writes committed baseline data, it should have at minimum:

Scenario	Expected
Valid raw JSON → correct `baselines.json` structure	Pass
Missing `"benchmarks"` key	`BenchmarkDataError` (once added)
`--slack 0` or `--slack -1`	`argparse.ArgumentTypeError`
Non-gated group entries filtered out	Absent from output
`machine_info` present → `machine` field populated	Populated correctly

6. Missing test for the exact 1.20× boundary

The gate condition is ratio > threshold (strict), so a ratio of exactly 1.20 must pass. This is not tested:

def test_exactly_at_threshold_passes(tmp_path) -> None:
    results = tmp_path / "results.json"
    baselines = tmp_path / "baselines.json"
    _write_results(results, [{"name": "test_parse_session_small", "stats": {"mean": 0.000120}}])
    _write_baselines(baselines, {"parse": {"test_parse_session_small": 0.0001}})
    assert check_regression(results, baselines) == 0  # 1.20x is not > 1.20

7. Missing test for "no current result for baseline" warning path

test_missing_baseline_warns_without_failing tests the case where a result has no matching baseline (WARN path). The opposite case — baseline entry exists but result is absent — is never tested:

def test_missing_current_result_warns_without_failing(tmp_path, capsys) -> None:
    results = tmp_path / "results.json"
    baselines = tmp_path / "baselines.json"
    _write_results(results, [])  # empty
    _write_baselines(baselines, {"parse": {"test_parse_session_small": 0.0001}})
    assert check_regression(results, baselines) == 0
    assert "no current result for baseline" in capsys.readouterr().out

8. `main()` does not expose `--threshold` as a CLI argument

The THRESHOLD = 1.20 constant is not overridable from the command line. For a tool designed to be run in CI, this makes it harder to locally test different thresholds or temporarily loosen the gate for a branch:

parser.add_argument("--threshold", type=float, default=THRESHOLD,
                    help="regression ratio threshold (default: 1.20)")

🟢 Low / Informational

9. `baselines.json` — Timestamp appears hand-crafted

The "updated" field reads "2026-06-17T21:00:00Z" — a suspiciously round timestamp. reduce_baselines.py uses datetime.now(UTC) which produces sub-second precision. If this was captured by a real CI run, the timestamp should look like "2026-06-17T21:03:47Z". If the values were manually authored, they may not reflect actual post-cache performance on ubuntu-latest.

10. `benchmarks/README.md` — Misleading phrasing

"Do not capture baselines on Windows/macOS for commit — ubuntu runners are slower and the gate will fail."

"ubuntu runners are slower" contradicts the intent. The actual reason is that baselines must match the gated runner OS (ubuntu-latest). Consider:

"Baselines must be captured on ubuntu-latest to match the CI runner. Cross-OS variance will cause spurious failures."

11. `Makefile` — No `clean` target

After running make update-baselines or make check-benchmarks, benchmarks/_raw.json and benchmark-results.json are left on disk (both in .gitignore). A clean target would be ergonomic:

clean:
	rm -f benchmarks/_raw.json benchmark-results.json

Summary table

#	Severity	File	Issue
1	🔴	`ci.yml`	Artifact upload skipped when regression gate fails
2	🔴	`reduce_baselines.py`	No error handling on malformed input
3	🔴	`check_benchmark_regression.py`	`BenchmarkDataError` uncaught in `main()`
4	🟡	`check_benchmark_regression.py`	`load_baseline_means` fallback silently discards data on missing `groups` key
5	🟡	`tests/`	No tests for `reduce_baselines.py`
6	🟡	`tests/`	No test for exact 1.20× boundary
7	🟡	`tests/`	No test for missing-current-result WARN path
8	🟡	`check_benchmark_regression.py`	`--threshold` not exposed as CLI argument
9	🟢	`baselines.json`	Timestamp looks hand-crafted, not from real CI run
10	🟢	`benchmarks/README.md`	Misleading "ubuntu runners are slower" phrasing
11	🟢	`Makefile`	No `clean` target

The most impactful fix to make immediately is #1 (if: always() on the artifact upload step) — losing the benchmark JSON on a regression failure defeats the purpose of the gate.

Upload benchmark-results.json even when the gate fails; add BenchmarkDataError handling to reduce_baselines; require explicit groups key; expose --threshold CLI flag; add reduce_baselines and boundary tests; clarify README.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

scripts/check_benchmark_regression.py (1)

34-41: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wrap stats.mean numeric conversion in BenchmarkDataError.

At Line 41, float(mean) can raise ValueError/TypeError and escape uncaught, causing a traceback instead of the CLI’s structured error path (ERROR: + exit code 2).

Suggested fix

         try:
             name = entry["name"]
             mean = entry["stats"]["mean"]
         except (KeyError, TypeError) as exc:
             raise BenchmarkDataError(
                 f"{path} benchmarks[{index}] missing 'name' or 'stats.mean'"
             ) from exc
-        results[str(name)] = float(mean)
+        try:
+            results[str(name)] = float(mean)
+        except (TypeError, ValueError) as exc:
+            raise BenchmarkDataError(
+                f"{path} benchmarks[{index}] has non-numeric 'stats.mean'"
+            ) from exc

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/check_benchmark_regression.py` around lines 34 - 41, The conversion
of mean to float using float(mean) on line 41 can raise ValueError or TypeError
exceptions that are not caught by the existing try-except block, causing
unhandled exceptions instead of properly formatted CLI errors. Move the results
assignment statement containing float(mean) inside the try block (before the
except clause) so that any conversion errors are caught and wrapped in
BenchmarkDataError along with the other data validation errors.

scripts/reduce_baselines.py (1)

58-63: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate machine_info shape before calling .get().

If machine_info is present but not an object, machine_info.get("system") raises AttributeError and bypasses your structured error handling.

Suggested fix

-    machine_info = raw.get("machine_info", {})
+    machine_info = raw.get("machine_info", {})
+    if machine_info is not None and not isinstance(machine_info, dict):
+        raise BenchmarkDataError(f"{path} 'machine_info' must be an object when present")
@@
-        "machine": machine_info.get("system"),
+        "machine": machine_info.get("system") if isinstance(machine_info, dict) else None,

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/reduce_baselines.py` around lines 58 - 63, The code calls
machine_info.get("system") without validating that machine_info is actually a
dictionary, which will raise an AttributeError if machine_info is not a dict
object. Before calling .get() on machine_info, add a type check to ensure it is
a dictionary instance, and provide a safe fallback value (such as None) if the
validation fails. This will ensure the structured error handling is not bypassed
by unexpected data types in the machine_info field.

🧹 Nitpick comments (1)

tests/test_reduce_baselines.py (1)

77-78: ⚡ Quick win

Assert the SystemExit code for stricter CLI validation.

Line 77 currently accepts any SystemExit, which can hide unrelated parse failures. Assert exc.value.code == 2 so the test specifically verifies argparse rejection for non-positive --slack.

Proposed test tightening

-    with pytest.raises(SystemExit):
-        main([str(raw), str(tmp_path / "out.json"), "--slack", "0"])
+    with pytest.raises(SystemExit) as exc:
+        main([str(raw), str(tmp_path / "out.json"), "--slack", "0"])
+    assert exc.value.code == 2

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_reduce_baselines.py` around lines 77 - 78, The
pytest.raises(SystemExit) on the main function call with the `--slack` argument
currently accepts any SystemExit without validating the specific exit code.
Modify the context manager to capture the exception info using `as exc_info` and
then assert that `exc_info.value.code == 2` after the main function call
completes, ensuring the test strictly validates that argparse is rejecting the
non-positive slack value (exit code 2) rather than failing for unrelated
reasons.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/reduce_baselines.py`:
- Around line 29-33: The code catches json.JSONDecodeError but does not handle
OSError that can be raised by path.read_text() and path.write_text() methods,
allowing raw tracebacks to leak for common CLI failures like missing files or
permission issues. Wrap path.read_text() calls in a try-except block that
catches both json.JSONDecodeError and OSError, converting OSError to
BenchmarkDataError with the error details included in the message. Apply the
same pattern to all path.write_text() calls in the script (at lines 65-66 and
81-85) to ensure consistent error handling throughout and prevent
filesystem-related exceptions from propagating as raw tracebacks.

---

Outside diff comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 34-41: The conversion of mean to float using float(mean) on line
41 can raise ValueError or TypeError exceptions that are not caught by the
existing try-except block, causing unhandled exceptions instead of properly
formatted CLI errors. Move the results assignment statement containing
float(mean) inside the try block (before the except clause) so that any
conversion errors are caught and wrapped in BenchmarkDataError along with the
other data validation errors.

In `@scripts/reduce_baselines.py`:
- Around line 58-63: The code calls machine_info.get("system") without
validating that machine_info is actually a dictionary, which will raise an
AttributeError if machine_info is not a dict object. Before calling .get() on
machine_info, add a type check to ensure it is a dictionary instance, and
provide a safe fallback value (such as None) if the validation fails. This will
ensure the structured error handling is not bypassed by unexpected data types in
the machine_info field.

---

Nitpick comments:
In `@tests/test_reduce_baselines.py`:
- Around line 77-78: The pytest.raises(SystemExit) on the main function call
with the `--slack` argument currently accepts any SystemExit without validating
the specific exit code. Modify the context manager to capture the exception info
using `as exc_info` and then assert that `exc_info.value.code == 2` after the
main function call completes, ensuring the test strictly validates that argparse
is rejecting the non-positive slack value (exit code 2) rather than failing for
unrelated reasons.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e6f0d19f-683d-41e8-a1eb-2f97136d0bc1

📥 Commits

Reviewing files that changed from the base of the PR and between 2482951 and b4bf963.

📒 Files selected for processing (7)

.github/workflows/ci.yml
Makefile
benchmarks/README.md
scripts/check_benchmark_regression.py
scripts/reduce_baselines.py
tests/test_check_benchmark_regression.py
tests/test_reduce_baselines.py

✅ Files skipped from review due to trivial changes (1)

benchmarks/README.md

🚧 Files skipped from review as they are similar to previous changes (2)

Makefile
.github/workflows/ci.yml

Wrap reduce_baselines I/O in BenchmarkDataError; guard machine_info type; move float(mean) into try block in load_results; assert argparse exit code 2.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

scripts/reduce_baselines.py (1)
11-11: ⚠️ Potential issue | 🟠 Major

Fix import mode mismatch at Line 11 to prevent startup crashes.

from scripts.check_benchmark_regression import BenchmarkDataError fails with ModuleNotFoundError: No module named 'scripts' when this file is run as python scripts/reduce_baselines.py. The scripts directory lacks an __init__.py file and the absolute import does not work in direct script execution mode, despite this being the documented invocation pattern in Makefile and benchmarks/README.md.
Suggested fix
-from scripts.check_benchmark_regression import BenchmarkDataError
+if __package__:
+    from .check_benchmark_regression import BenchmarkDataError
+else:
+    from check_benchmark_regression import BenchmarkDataError
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/reduce_baselines.py` at line 11, The absolute import of
BenchmarkDataError from scripts.check_benchmark_regression fails when
reduce_baselines.py is run directly as a script because the scripts directory is
not a Python package. Change the import statement from the absolute import `from
scripts.check_benchmark_regression import BenchmarkDataError` to a relative
import `from .check_benchmark_regression import BenchmarkDataError` since both
files are in the same scripts directory. This will allow the module to be
correctly resolved when the file is executed as a script.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@scripts/reduce_baselines.py`:
- Line 11: The absolute import of BenchmarkDataError from
scripts.check_benchmark_regression fails when reduce_baselines.py is run
directly as a script because the scripts directory is not a Python package.
Change the import statement from the absolute import `from
scripts.check_benchmark_regression import BenchmarkDataError` to a relative
import `from .check_benchmark_regression import BenchmarkDataError` since both
files are in the same scripts directory. This will allow the module to be
correctly resolved when the file is executed as a script.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f630845d-4f13-4730-8970-fe8278b98d3c

📥 Commits

Reviewing files that changed from the base of the PR and between b4bf963 and 6a6cb6e.

📒 Files selected for processing (3)

scripts/check_benchmark_regression.py
scripts/reduce_baselines.py
tests/test_reduce_baselines.py

🚧 Files skipped from review as they are similar to previous changes (2)

tests/test_reduce_baselines.py
scripts/check_benchmark_regression.py

Drop test_parse_session_small and test_search_full_corpus from gated baselines; clarify ubuntu CI provenance in _note; detect duplicate benchmark names; expand reduce_baselines tests.

…benchmark_regression.py

clean6378-max-it · 2026-06-17T23:50:11Z

@CodeRabbit, can you verify if @timon0305 's all comments are addressed?

coderabbitai · 2026-06-17T23:51:17Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== .github/workflows/ci.yml ==="
cat .github/workflows/ci.yml

echo ""
echo "=== scripts/check_benchmark_regression.py ==="
cat scripts/check_benchmark_regression.py

echo ""
echo "=== scripts/reduce_baselines.py ==="
cat scripts/reduce_baselines.py

echo ""
echo "=== tests/test_check_benchmark_regression.py ==="
cat tests/test_check_benchmark_regression.py

echo ""
echo "=== tests/test_reduce_baselines.py ==="
cat tests/test_reduce_baselines.py

echo ""
echo "=== Makefile ==="
cat Makefile

echo ""
echo "=== benchmarks/README.md ==="
cat benchmarks/README.md

echo ""
echo "=== benchmarks/baselines.json ==="
cat benchmarks/baselines.json

Length of output: 30061

@clean6378-max-it Here is a full status check against all 11 issues from the earlier review:

#	Severity	Issue	Status
1	🔴	`ci.yml` — `if: always()` on artifact upload	✅ Fixed
2	🔴	`reduce_baselines.py` — unhandled `KeyError` on malformed input	✅ Fixed
3	🔴	`check_benchmark_regression.py` — `BenchmarkDataError` uncaught in `main()`	✅ Fixed
4	🟡	`load_baseline_means` — fragile fallback when `"groups"` key absent	✅ Fixed
5	🟡	No tests for `reduce_baselines.py`	✅ Fixed (5 tests added in `tests/test_reduce_baselines.py`)
6	🟡	No test for exact 1.20× boundary	✅ Fixed (`test_exactly_at_threshold_passes`)
7	🟡	No test for "no current result for baseline" WARN path	✅ Fixed (`test_missing_current_result_warns_without_failing`)
8	🟡	`--threshold` not exposed as CLI argument	✅ Fixed (added to `argparse` in `main()`)
9	🟢	`baselines.json` — timestamp `"2026-06-17T21:00:00Z"` appears hand-crafted	⚠️ Still present
10	🟢	`benchmarks/README.md` — misleading "ubuntu runners are slower" phrasing	✅ Fixed
11	🟢	`Makefile` — no `clean` target	✅ Fixed (`clean-benchmark-artifacts` target added)

Summary: All 3 critical 🔴 issues and all 5 medium 🟡 issues are fully addressed. 2 of 3 low/informational 🟢 issues are also resolved.

The only remaining item is Issue #9 — baselines.json still has "updated": "2026-06-17T21:00:00Z", which is a suspiciously round timestamp suggesting the file may not have been regenerated from a real CI run via make update-baselines. This is low severity and purely informational, but worth noting if you want the file to reflect a genuine post-cache capture.

ci: populate baselines and gate benchmarks at +20%

12bf70c

Add check_benchmark_regression.py and reduce_baselines.py; populate benchmarks/baselines.json from post-cache run; wire regression gate into ubuntu benchmarks job; add Makefile update-baselines target and tests.

clean6378-max-it self-assigned this Jun 17, 2026

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread benchmarks/baselines.json Outdated

clean6378-max-it added 2 commits June 18, 2026 04:45

fix: set benchmark baselines from ubuntu-latest CI means

b9d0707

fix: harden benchmark gate scripts with input validation

2482951

Raise BenchmarkDataError for malformed JSON or missing keys; skip zero baselines safely; reject non-positive --slack values.

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

fix: harden benchmark gate per PR #91 review

b4bf963

Upload benchmark-results.json even when the gate fails; add BenchmarkDataError handling to reduce_baselines; require explicit groups key; expose --threshold CLI flag; add reduce_baselines and boundary tests; clarify README.

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread scripts/reduce_baselines.py

fix: harden benchmark scripts against filesystem and type errors

6a6cb6e

Wrap reduce_baselines I/O in BenchmarkDataError; guard machine_info type; move float(mean) into try block in load_results; assert argparse exit code 2.

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

fix: allow reduce_baselines to run as a direct script

5d816d5

clean6378-max-it requested a review from timon0305 June 17, 2026 22:17

timon0305 requested changes Jun 17, 2026

View reviewed changes

fix: exclude noisy micro-benchmarks from regression gate

001c5f0

Drop test_parse_session_small and test_search_full_corpus from gated baselines; clarify ubuntu CI provenance in _note; detect duplicate benchmark names; expand reduce_baselines tests.

clean6378-max-it force-pushed the ci/benchmark-regression-gate branch from d8ce6cf to 001c5f0 Compare June 17, 2026 23:30

clean6378-max-it added 3 commits June 18, 2026 07:32

fix: ruff format check_benchmark_regression.pyfix: ruff format check_…

0f55400

…benchmark_regression.py

fix: enforce EXCLUDED_FROM_GATE at check time and harden file reads

bdce5ae

fix: enforce EXCLUDED_FROM_GATE at check time and harden file reads

a79866f

timon0305 approved these changes Jun 17, 2026

View reviewed changes

clean6378-max-it requested a review from wpak-ai June 18, 2026 06:46

wpak-ai approved these changes Jun 18, 2026

View reviewed changes

wpak-ai merged commit ad83f10 into master Jun 18, 2026
16 checks passed

wpak-ai deleted the ci/benchmark-regression-gate branch June 18, 2026 15:00

Conversation

clean6378-max-it commented Jun 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

scripts/check_benchmark_regression.py (new)

scripts/reduce_baselines.py (new)

benchmarks/baselines.json

CI (.github/workflows/ci.yml)

Makefile (new)

Tests

Docs / hygiene

Acceptance criteria

Test plan

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

clean6378-max-it commented Jun 17, 2026

Uh oh!

coderabbitai Bot commented Jun 17, 2026

Deep Review: PR #91 — Benchmark Regression Gate

🔴 Critical

1. ci.yml — Benchmark artifact lost on regression failure

2. reduce_baselines.py — Unhandled KeyError on malformed input

3. check_benchmark_regression.py — BenchmarkDataError uncaught in main()

🟡 Medium

4. load_baseline_means — Fragile fallback that silently discards data

5. Missing tests for reduce_baselines.py

6. Missing test for the exact 1.20× boundary

7. Missing test for "no current result for baseline" warning path

8. main() does not expose --threshold as a CLI argument

🟢 Low / Informational

9. baselines.json — Timestamp appears hand-crafted

10. benchmarks/README.md — Misleading phrasing

11. Makefile — No clean target

Summary table

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clean6378-max-it commented Jun 17, 2026

Uh oh!

coderabbitai Bot commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

clean6378-max-it commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading

`scripts/check_benchmark_regression.py` (new)

`scripts/reduce_baselines.py` (new)

`benchmarks/baselines.json`

CI (`.github/workflows/ci.yml`)

`Makefile` (new)

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading

Deep Review: PR `#91` — Benchmark Regression Gate

1. `ci.yml` — Benchmark artifact lost on regression failure

2. `reduce_baselines.py` — Unhandled `KeyError` on malformed input

3. `check_benchmark_regression.py` — `BenchmarkDataError` uncaught in `main()`

4. `load_baseline_means` — Fragile fallback that silently discards data

5. Missing tests for `reduce_baselines.py`

8. `main()` does not expose `--threshold` as a CLI argument

9. `baselines.json` — Timestamp appears hand-crafted

10. `benchmarks/README.md` — Misleading phrasing

11. `Makefile` — No `clean` target